# Text-Based Ideal Points Model (Code and Data) for US House, Sessions 115 and 116 (2017-2021).

## This data and code uses TBIP to estimate ideal point values for legislators from their texts, and also estimates vote-based ideal points with the same framework to enable comparison of ideological positionining and expression of members of Congress across three venues: _Floor Speeches_, _Twitter Tweets_, and _Roll-Call Votes_. 

**First, please read the README at the original TBIP repo (https://github.com/keyonvafa/tbip) to get the required overview, and install the required libraries (pip install -r requirements.txt) in your environment. It is important to understand the data files and basic structure from the original TBIP code as this directory directly builds from that! Please cite the TBIP paper by Vafa et al. if using this software: Vafa, K., Naidu, S., & Blei, D. (2020, July). Text-Based Ideal Points. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5345-5357).** 

## Data
---

There are three datasets provided, one for each of the venues (audiences) studied in this work:  
  
  1. **floor_speeches_congs_115_116**
  2. **tweets_cong_115_116**
  3. **congs_115-116_votes**

Below, we highlight the _pipeline used for each of these datasets to convert raw text or vote data to ideal point estimates_. The ideal point estimates along with a host of information about the legislators are then combined and presented in our main file used for analyses conducted in the paper (`../legislator_info_and_tbip_congresses_115_and_116.csv`); **for combining all results and creating that file, please view the documented code in the main project directory (`../combine_and_create_main_file_for_conducting_research.ipynb`)**. 

NOTE: For source/reference information regarding each raw data that serves as the starting point for ideal point estimation process, please view the README in the corresponding data/ subdirectory. 

## Process for deriving ideal point estimates as well as various ideological topic modeling estimates from raw data: 


Note: make sure jupyter notebook can access the python 3.6 conda environment by running the following: 

```
$ conda install ipykernel
$ ipython kernel install --user --name=<your_env_name>
```

### Floor Speeches
---

**Step 1:** 

Input: Raw data file for floor speeches derived from the Congressional Record: `data/floor_speeches_congs_115_116/raw_original_data_floor_speeches_house.csv` 

Output: Processed files in `data/floor_speeches_congs_115_116/clean/`

Process: Run the script: `data/floor_speeches_congs_115_116/preprocess_floor_speeches_and_convert_to_bag_of_words.ipynb` 


**Step 2:** 

Input: Processed data files in `data/floor_speeches_congs_115_116/clean/`

Output: Processed floor speech data after removing procedural speeches in: `data/floor_speeches_congs_115_116/clean_removing_procedural/`

NOTE: This data cleaning step is particular to floor speeches, since many floor speeches can be low on content and high on procedural legislative jargon: both the logic of the code and the intuition behind removing such speeches comes from: **Card, Dallas, Serina Chang, Chris Becker, Julia Mendelsohn, Rob Voigt, Leah Boustan, Ran Abramitzky, and Dan Jurafsky. "Computational analysis of 140 years of US political speeches reveals more positive but increasingly polarized framing of immigration." Proceedings of the National Academy of Sciences 119, no. 31 (2022): e2120510119.** 

Process (run from the `data/floor_speeches_congs_115_116/` subdirectory): 

1. Run the following script: `python filter_procedural.py --input_fpath clean/raw_documents.txt --output_fpath raw_documents_without_procedural.txt`

2. Then, using the above creating txt file, run `preprocessing_raw_speeches_after_procedural_speech_removal.ipynb` to create the files in `data/floor_speeches_congs_115_116/clean_removing_procedural/`

3. Finally, creating a json file for the vocab (useful for the next step) by running the python script: `clean_removing_procedural/vocab_txt_to_json.py`


**Step 3.1 (can be run in parallel with 3.2):** (takes a long time to run, likely about a day or more on CPU)

Input: Processed floor speech data after removing procedural speeches in: `data/floor_speeches_congs_115_116/clean_removing_procedural/`

Output: Poisson factorization topic modeling output results stored in 10 subdirectories of the form: `data/floor_speeches_congs_115_116/pf-fits-removed-procedural-speeches-k50-seed*`

Process: Run (from `poisson_scripts`)  - `bash floor_speeches_congs_115_116.sh`

NOTE: Poisson factorization is run ten times with different random seeds to get an expected mean value in order to scale MALLET topic modeling output: this scaling is needed in order to use MALLET topic modeling results as input to text-based ideal point estimation. 

**Step 3.2 (can be run in parallel with 3.1):** 

Input: Processed floor speech data after removing procedural speeches in: `data/floor_speeches_congs_115_116/clean_removing_procedural/`

Output: MALLET topic modeling output files in `data/floor_speeches_congs_115_116/mallet_fits_removed_procedural_speeches/`

Process:

1. Get the soup-nuts package to run MALLET topic model from: https://github.com/ahoho/topics (follow the instructions); and cite the following: 

`Hoyle, Alexander, Pranav Goel, Andrew Hian-Cheong, Denis Peskov, Jordan Boyd-Graber, and Philip Resnik. "Is automated topic model evaluation broken? the incoherence of coherence." Advances in neural information processing systems 34 (2021): 2018-2033.`

Specific instruction to follow (from `tbip/`): 

```
$ git clone -b dev https://github.com/ahoho/topics.git --recurse-submodules
$ cd topics/

(download mallet tar file from https://github.com/mimno/Mallet/archive/refs/tags/v2.0.8RC3.tar.gz and rename to mallet-2.0.8.tar.gz)

$ tar -xzvf mallet-2.0.8.tar.gz
$ ant (from inside the Mallet directory)
$ conda create -n mallet python=3.8
$ conda activate mallet
$ pip install gensim==3.8.3
$ pip install configargparse
$ pip install pandas
$ pip install tqdm
```

2. Adjusting file paths accordingly, run (from `tbip`): `python topics/soup_nuts/models/gensim/lda.py --input_dir data/floor_speeches_congs_115_116/clean_removing_procedural/ --model mallet --output_dir data/floor_speeches_congs_115_116/mallet_fits_removed_procedural_speeches --train_path counts.npz --eval_path counts.npz --vocab_path vocabulary.json --num_topics 50 --optimize_interval 10 --workers 8 --mallet_path topics/mallet-2.0.8/bin/mallet`

Note: your mallet path might be `topics/Mallet-2.0.8RC3/bin/mallet` instead!


**Step 4:**

Input: MALLET topic modeling output files in `data/floor_speeches_congs_115_116/mallet_fits_removed_procedural_speeches/`

Output: Scaled MALLET topic modeling output files for use in subsequent ideal point estimation, also stored in `data/floor_speeches_congs_115_116/mallet_fits_removed_procedural_speeches/`

Process: Run the following script: `python setup/scale_mallet_output_using_poisson_factorization_runs.py --base_dir data/floor_speeches_congs_115_116/ --glob_pattern "pf-fits-removed-procedural-speeches-k50-seed*" --input_mallet_dir data/floor_speeches_congs_115_116/mallet_fits_removed_procedural_speeches/ --beta_fname beta.npy --theta_fname doctopics.txt`


**Step 5:**

Input: `data/floor_speeches_congs_115_116/clean_removing_procedural/` and `data/floor_speeches_congs_115_116/mallet_fits_removed_procedural_speeches/`

Output: `data/floor_speeches_congs_115_116/tbip-pytorch-fits-og-rem-procedural-speeches-k50-init-mallet/`

Process: Run the script (from `tbip_scripts`): `bash floor_speeches_congs_115_116.sh`

NOTE: We highly recommending running the above script using a GPU device rather than on CPU. CPU will take a long time (multiple months!). 


**Step 6:**

Input: Host of subdirectories present in `data/floor_speeches_congs_115_116/`

Output: Files stored in `../speeches_results/` 

Process: Running the code in the notebook: `analysis/analyze_floor_speeches_ideal_points.ipynb`



### Twitter Tweets
---

**Step 1:** 

Input: Raw data file for (subsampled) tweets texts for representatives in the US House (Congresses 115 and 116), extracted using Twitter API using available data for members' Twitter user IDs: `data/tweets_cong_115_116/all_tweets_df.csv` 

Output: Processed data files in `data/tweets_cong_115_116/clean2/`

Process: Run the script: `data/tweets_cong_115_116/preprocess_tweets_and_convert_to_bag_of_words.ipynb`

NOTE: The full data extracted from Twitter had upward on 3 million+ tweets, which is computationally intractable for the models used. So we downsample to about 300k tweets while preserving the number-of-tweets-per-author distribution. Our script for downsampling while preserving author distribution is provided at: `setup/filter_and_sample_cong_twitter_data.py` for instructional purposes. 


**Step 2.1: (can be run in parallel with 2.2):** (takes a long time to run, likely about a day or more on CPU; needs high-memory CPU with likely > 16gb RAM)

Input: Processed data files in `data/tweets_cong_115_116/clean2/`

Output: Poisson factorization topic modeling output results stored in 10 subdirectories of the form: `data/tweets_cong_115_116/pf-fits-expanded-vocab-k50-seed*`

Process: Run (from `poisson_scripts`)  - `bash tweets_cong_115_116.sh`

NOTE: Poisson factorization is run ten times with different random seeds to get an expected mean value in order to scale MALLET topic modeling output: this scaling is needed in order to use MALLET topic modeling results as input to text-based ideal point estimation. 


**Step 2.2: (can be run in parallel with 2.1):** 

Input: Processed data files in `data/tweets_cong_115_116/clean2/`

Output: MALLET topic modeling output files in `data/tweets_cong_115_116/mallet_results/tbip_expanded_preprocessing_k50/`

Process (should already be set up the same way as for floor speeches above -- a python 3.8 conda env activated etc.):

1. Get the soup-nuts package to run MALLET topic model from: https://github.com/ahoho/topics (follow the instructions); and cite the following: 

`Hoyle, Alexander, Pranav Goel, Andrew Hian-Cheong, Denis Peskov, Jordan Boyd-Graber, and Philip Resnik. "Is automated topic model evaluation broken? the incoherence of coherence." Advances in neural information processing systems 34 (2021): 2018-2033.`

Specific instruction to follow (from `tbip/`): 

```
$ git clone -b dev https://github.com/ahoho/topics.git --recurse-submodules
$ cd topics/

(download mallet tar file from https://github.com/mimno/Mallet/archive/refs/tags/v2.0.8RC3.tar.gz and rename to mallet-2.0.8.tar.gz)

$ tar -xzvf mallet-2.0.8.tar.gz
$ ant (from inside the Mallet directory)
$ conda create -n mallet python=3.8
$ conda activate mallet
$ pip install gensim==3.8.3
$ pip install configargparse
$ pip install pandas
$ pip install tqdm
```

2. Adjusting file paths accordingly, run: `python topics/soup_nuts/models/gensim/lda.py --input_dir data/tweets_cong_115_116/clean2/ --model mallet --output_dir data/tweets_cong_115_116/mallet_results/tbip_expanded_preprocessing_k50 --train_path counts.npz --eval_path counts.npz --vocab_path vocabulary.json --num_topics 50 --optimize_interval 10 --workers 8 --mallet_path topics/mallet-2.0.8/bin/mallet`

Note: your mallet path might be `topics/Mallet-2.0.8RC3/bin/mallet` instead!


**Step 3:** 

Input: MALLET topic modeling output files in `data/tweets_cong_115_116/mallet_results/tbip_expanded_preprocessing_k50/`

Output: Scaled MALLET topic modeling output files for use in subsequent ideal point estimation, also stored in `data/tweets_cong_115_116/mallet_results/tbip_expanded_preprocessing_k50/`

Process: Run the following script: `python setup/scale_mallet_output_using_poisson_factorization_runs_twitter.py --base_dir data/tweets_cong_115_116/ --glob_pattern "pf-fits-expanded-vocab-k50-seed*" --input_mallet_dir data/tweets_cong_115_116/mallet_results/tbip_expanded_preprocessing_k50/ --beta_fname beta.npy --theta_fname doctopics.txt` 


**Step 4:** 

Input: `data/tweets_cong_115_116/clean2/` and `data/tweets_cong_115_116/mallet_results/tbip_expanded_preprocessing_k50/`

Output: `data/tweets_cong_115_116/tbip-og-k50-expanded-vocab-with-mallet-scaled-topics/`

Process: Run the script (from `tbip_scripts`): `bash tweets_cong_115_116.sh`

NOTE: We highly recommending running the above script using a GPU device rather than on CPU. CPU will take a long time (multiple months!).


**Step 5:** 

Input: Host of subdirectories present in `data/tweets_cong_115_116/`

Output: Files stored in `../tweets_results/`

Process: Running the code in the notebook: `analysis/analyze_tweet_ideal_points.ipynb`



### Roll-Call Votes
---

**Step 1:** 

Input: Raw data files in `data/congs_115-116_votes/raw/` 

Output: Processed data files in `data/congs_115-116_votes/clean/`

Process: Run the script: `python setup/preprocess_house_congs_votes.py` 


**Step 2:** 

Input: Processed data files in `data/congs_115-116_votes/clean/`

Output: Vote-based ideal point estimates (derived using variational inference) in `data/congs_115-116_votes/fits/`

Process: Run the script (from `tbip_scripts`): `bash estimate_vote_ideal_points_congs_115_116.sh`

### Final Step
---

To get the intermediate file with TBIP values used in the research, the notebook in main directory needs to be run: `combine_and_create_main_file_for_conducting_research.ipynb`

NOTE: For running this above notebook, a newer version of pandas (pandas >= 1.2) is needed than what is supported by python 3.6 and the pandas version needed to run earlier steps. Please use a different python environment and install pandas >= 1.2, openpyxl, and numpy in order to run the above notebook. 


#### Human (Political Science Grad Students) Validation of unsupervised Topic Modeling (Pre-TBIP-estimation) and TBIP outputs
---

`analysis/human_annotation_files/` contains code for creation of annotation files along with instructions PDF sent alongside xlsx files to annotators, annotation results, and analysis code for computing agreement, etc. 